Analysis

The analysis pipelines used by IGSR build on those created for the 1000 Genomes Project. For more detailed information about the analysis methods used by the 1000 Genomes Project in its different phases, please refer to our publications.

Publications

Alignment

GRCh38 with alternative sequences, plus decoys and HLA

IGSR is employing an alt-aware alignment strategy using the most recent version of BWA-mem when aligning data to GRCh38. This uses the full GRCh38 reference, including ALT contigs, decoy and EBV sequences (accession GCA_000001405). In addition, more than 500 HLA sequences compiled by Heng Li from the IMGT/HLA database provided by the Immuno Polymorphism Database (IPD) were included as well.

The pipeline aligns sequence data at the run level and then merges runs belonging to the same sample together to produce sample level alignments. GATK BAM improvement steps are used, as in the 1000 Genomes phase 3 pipeline. By using the complete GRCh38 genome, we should have improved read mapping accuracy, providing a better foundation for further analyses.

Information on alt-aware BWA can be found on the bwa site.

GRCh37

During the main 1000 Genomes Project, sequence reads were aligned to GRCh37. In phase 1, reference as providing by the Genome Reference Consortium was used, in phase 3, decoy sequence was added to the reference to reduce the rate of mismapping.

The phase1 reference FASTA can be found in technical/reference directory . It represented the full chromosomes of the GRCh37 build of the human reference. The phase 3 reference can be found in the phase2_reference_assembly_sequence directory. This contains both the full reference and the additional decoy sequence.

NCBI36

In the pilot phase of the 1000 Genomes Project, the data was mapped to sex matched copies of NCBI36. Our reference files can be found under the pilot_data directory.

Mapping Algorithms

During the 1000 Genomes Project, different mapping algorithms were used for data types. The table below describes which algorithms were used for the different data types and technology combinations in the different phases of the project.

Phase	Techology	Low Coverage	Exome/Exon Targetted	High Coverage
Pilot	Illumina	MAQ	MOSAIK	MAQ
Pilot	454	MAQ	MOSAIK	MAQ
Pilot	SOLiD	corona	N/A	corona
Phase 1	Illumina	bwa	bfast	N/A
Phase 1	454	bwa	bfast	N/A
Phase 1	SOLiD	bfast	bfast	N/A
Phase 3	Illumina	bwa	N/A	N/A

MAQ: http://maq.sourceforge.net/
Corona Lite: (no longer available as of 2018)
MOSAIK: http://code.google.com/p/mosaik-aligner/
BWA: http://bio-bwa.sourceforge.net/
bfast: http://sourceforge.net/projects/bfast/

SNP calling in the 1000 Genomes Project

Over the course of the 1000 Genomes Project, how variants were called from the samples changed quite dramatically. Two clear lessons from the project were, when considering low coverage data, calling from multiple samples at once produces more, higher quality variants and considering the sites discovered from multiple algorithms improved the discovery rate and accuracy of discovery. Many different programs and strategies were developed over the duration of the project. The publications referred to at the top of this page are the best place to get a description of what programs were used and how the 1000 Genomes variant calling pipeline was run.

IGSR: The International Genome Sample Resource

Supporting open human variation data

Links